{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9c8368b2",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llama_datasets/uploading_llama_dataset.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1448e661-eaab-4e88-9f37-80566567e677",
   "metadata": {},
   "source": [
    "# Contributing a LlamaDataset To LlamaHub"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee479cd5-40eb-4e7d-92a8-42202bc700af",
   "metadata": {},
   "source": [
    "`LlamaDataset`'s storage is managed through a git repository. To contribute a dataset requires making a pull request to `llama_index/llama_datasets` Github (LFS) repository. \n",
    "\n",
    "To contribute a `LabelledRagDataset` (a subclass of `BaseLlamaDataset`), two sets of files are required:\n",
    "\n",
    "1. The `LabelledRagDataset` saved as json named `rag_dataset.json`\n",
    "2. Source document files used to create the `LabelledRagDataset`\n",
    "\n",
    "This brief notebook provides a quick example using the Paul Graham Essay text file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1ec7dfd",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4639b268-e0d4-40de-af97-198be1c62c62",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2acea4a6-30a5-45fb-a5e3-f4c0ba013154",
   "metadata": {},
   "source": [
    "### Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0784e81f-7845-4d34-a196-9f699356c999",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ef0150c-1bce-4474-8fec-8b769aff192c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "# Load documents and build index\n",
    "documents = SimpleDirectoryReader(\n",
    "    input_files=[\"data/paul_graham/paul_graham_essay.txt\"]\n",
    ").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8de410fc-8ceb-49f2-9b43-06caf3dced31",
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate questions against chunks\n",
    "from llama_index.core.llama_dataset.generator import RagDatasetGenerator\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "# set context for llm provider\n",
    "llm_gpt35 = OpenAI(model=\"gpt-4\", temperature=0.3)\n",
    "\n",
    "# instantiate a DatasetGenerator\n",
    "dataset_generator = RagDatasetGenerator.from_documents(\n",
    "    documents,\n",
    "    llm=llm_gpt35,\n",
    "    num_questions_per_chunk=2,  # set the number of questions per nodes\n",
    "    show_progress=True,\n",
    ")\n",
    "\n",
    "rag_dataset = dataset_generator.generate_dataset_from_nodes()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45a36b15-19c5-4a69-997b-5ded125544e5",
   "metadata": {},
   "source": [
    "Now that we have our `LabelledRagDataset` generated (btw, it's totally fine to manually create one with human generated queries and reference answers!), we store this into the necessary json file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e34c2d0d-caf6-4960-bb37-6abd57b1ba4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "rag_dataset.save_json(\"rag_dataset.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5088c85-6b56-4cf7-a93e-d0ba5880dfb3",
   "metadata": {},
   "source": [
    "#### Generating Baseline Results\n",
    "\n",
    "In addition to adding just a `LlamaDataset`, we also encourage adding baseline benchmarks for others to use as sort of measuring stick against their own RAG pipelines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b366e286-39fb-487e-8ba4-1835352a0a94",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "\n",
    "# a basic RAG pipeline, uses defaults\n",
    "index = VectorStoreIndex.from_documents(documents=documents)\n",
    "query_engine = index.as_query_engine()\n",
    "\n",
    "# manually\n",
    "prediction_dataset = await rag_dataset.amake_predictions_with(\n",
    "    query_engine=query_engine, show_progress=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db8640fc-9b94-4f22-b831-411b8f95d275",
   "metadata": {},
   "source": [
    "## Submitting The Pull-Requests"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "093e63cc-f244-42ec-bbc4-b67f104a2a51",
   "metadata": {},
   "source": [
    "With the `rag_dataset.json` and source file `paul_graham_essay.txt` (note in this case, there is only one source document, but there can be several), we can perform the two steps for contributing a `LlamaDataset` into `LlamaHub`:\n",
    "\n",
    "1. Similar, to how contributions are made for `loader`'s, `agent`'s and `pack`'s, create a pull-request for `llama_hub` repository that adds a new folder for new `LlamaDataset`. This step uploads the information about the new `LlamaDataset` so that it can be presented in the `LlamaHub` UI.\n",
    "\n",
    "2. Create a pull request into `llama_datasets` repository to actually upload the data files."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "539c8f2f-bf6f-4a6d-b53f-0c8ca85fc891",
   "metadata": {},
   "source": [
    "### Step 0 (Pre-requisites)\n",
    "\n",
    "Fork and then clone (onto your local machine) both, the `llama_hub` Github repository as well as the `llama_datasets` one. You'll be submitting a pull requests into both of these repos from a new branch off of your forked versions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "611a8a75-1aaf-479a-94d0-a9127f08fe12",
   "metadata": {},
   "source": [
    "### Step 1\n",
    "\n",
    "Create a new folder in `llama_datasets/` of the `llama_hub` Github repository. For example, in this case we would create a new folder `llama_datasets/paul_graham_essay`.\n",
    "\n",
    "In that folder, two files are required:\n",
    "- `card.json`\n",
    "- `README.md`\n",
    "\n",
    "In particular, on your local machine:\n",
    "\n",
    "```\n",
    "cd llama_datasets/\n",
    "mkdir paul_graham_essay\n",
    "touch card.json\n",
    "touch README.md\n",
    "```\n",
    "\n",
    "The suggestion here is to look at previously submitted `LlamaDataset`'s and modify their respective files as needed for your new dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ee08d0e-2601-4978-bdb2-e28375454d7f",
   "metadata": {},
   "source": [
    "In our current example, we need the `card.json` to look as follows\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"name\": \"Paul Graham Essay\",\n",
    "    \"description\": \"A labelled RAG dataset based off an essay by Paul Graham, consisting of queries, reference answers, and reference contexts.\",\n",
    "    \"numberObservations\": 44,\n",
    "    \"containsExamplesByHumans\": false,\n",
    "    \"containsExamplesByAI\": true,\n",
    "    \"sourceUrls\": [\n",
    "        \"http://www.paulgraham.com/articles.html\"\n",
    "    ],\n",
    "    \"baselines\": [\n",
    "        {\n",
    "            \"name\": \"llamaindex\",\n",
    "            \"config\": {\n",
    "                \"chunkSize\": 1024,\n",
    "                \"llm\": \"gpt-3.5-turbo\",\n",
    "                \"similarityTopK\": 2,\n",
    "                \"embedModel\": \"text-embedding-ada-002\"\n",
    "            },\n",
    "            \"metrics\": {\n",
    "                \"contextSimilarity\": 0.934,\n",
    "                \"correctness\": 4.239,\n",
    "                \"faithfulness\": 0.977,\n",
    "                \"relevancy\": 0.977\n",
    "            },\n",
    "            \"codeUrl\": \"https://github.com/run-llama/llama_datasets/blob/main/baselines/paul_graham_essay/llamaindex_baseline.py\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c332954-560a-4769-be57-787089add4d5",
   "metadata": {},
   "source": [
    "And for the `README.md`, these are pretty standard, requiring you to change the name of the dataset argument in the `download_llama_dataset()` function call.\n",
    "\n",
    "```python\n",
    "from llama_index.llama_datasets import download_llama_datasets\n",
    "from llama_index.llama_pack import download_llama_pack\n",
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "# download and install dependencies for rag evaluator pack\n",
    "RagEvaluatorPack = download_llama_pack(\n",
    "  \"RagEvaluatorPack\", \"./rag_evaluator_pack\"\n",
    ")\n",
    "rag_evaluator_pack = RagEvaluatorPack()\n",
    "\n",
    "# download and install dependencies for benchmark dataset\n",
    "rag_dataset, documents = download_llama_datasets(\n",
    "  \"PaulGrahamEssayTruncatedDataset\", \"./data\"\n",
    ")\n",
    "\n",
    "# evaluate\n",
    "query_engine = VectorStoreIndex.as_query_engine()  # previously defined, not shown here\n",
    "rag_evaluate_pack.run(dataset=paul_graham_qa_data, query_engine=query_engine)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "086f63ab-06f7-4a52-8271-9031530123e6",
   "metadata": {},
   "source": [
    "Finally, the last item for Step 1 is to create an entry to `llama_datasets/library.json` file. In this case:\n",
    "\n",
    "```json\n",
    "    ...,\n",
    "    \"PaulGrahamEssayDataset\": {\n",
    "    \"id\": \"llama_datasets/paul_graham_essay\",\n",
    "    \"author\": \"andrei-fajardo\",\n",
    "    \"keywords\": [\"rag\"],\n",
    "    \"extra_files\": [\"paul_graham_essay.txt\"]\n",
    "  }\n",
    "```\n",
    "\n",
    "Note: the `extra_files` field is reserved for the source files."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ec3d5a7-6ba8-40fa-bed6-a95d6c8813d6",
   "metadata": {},
   "source": [
    "### Step 2 Uploading The Actual Datasets\n",
    "\n",
    "In this step, since we use Github LFS on our `llama_datasets` repo, making a contribution is exactly the same way you would make a contribution with any of our other open Github repos. That is, submit a pull request.\n",
    "\n",
    "Make a fork of the `llama_datasets` repo, and create a new folder in the `llama_datasets/` directory that matches the `id` field of the entry made in the `library.json` file. So, for this example, we'll create a new folder `llama_datasets/paul_graham_essay/`. It is here where we will add the documents and make the pull request.\n",
    "\n",
    "To this folder, add `rag_dataset.json` (it must be called this), as well as the rest of the source documents, which in our case is the `paul_graham_essay.txt` file.\n",
    "\n",
    "```sh\n",
    "llama_datasets/paul_graham_essay/\n",
    "├── paul_graham_essay.txt\n",
    "└── rag_dataset.json\n",
    "```\n",
    "\n",
    "Now, simply `git add`, `git commit` and `git push` your branch, and make your PR."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
